This data set contains information about loans. Has over 113000+ observations and 81 variables. Variables include data about the loan, borrower, lenders and investors. Studying this data should help understand factors that have an effect with loan agreements. For my personal benefit, I hope to better understand the factors that could help me better obtain a comfortable mortage and pay it off to finally own a house after wishing it for years.
Started by removing ambigous employment statuses from the data set also removed outliers which made the charts very difficult to read. In addition, created a data frame with means and medians to better compare this information.
## loans2$IncomeRange: $0
## NULL
## --------------------------------------------------------
## loans2$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.083 1166.667 1583.333 1427.846 1833.333 10000.000
## --------------------------------------------------------
## loans2$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.083 9083.333 10166.667 11055.064 12500.000 20825.000
## --------------------------------------------------------
## loans2$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.083 2666.667 3166.667 3139.743 3593.458 9500.000
## --------------------------------------------------------
## loans2$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.083 4500.000 5000.000 5025.569 5500.000 9688.917
## --------------------------------------------------------
## loans2$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4141 6583 7000 7056 7500 13333
## --------------------------------------------------------
## loans2$IncomeRange: Not displayed
## NULL
## --------------------------------------------------------
## loans2$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.083 0.083 856.000 1291.281 1507.000 9096.000
##
## Pearson's product-moment correlation
##
## data: loans2$BorrowerAPR and loans2$LenderYield
## t = 2322, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9909477 0.9911709
## sample estimates:
## cor
## 0.99106
##
## Pearson's product-moment correlation
##
## data: loans2$OpenCreditLines and loans2$StatedMonthlyIncome
## t = 91.229, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2743744 0.2859303
## sample estimates:
## cor
## 0.2801625
The most interesting observations I found among this data, was not the relations between variables, but instead the lack of relation between variables where I was expecting the opposite. One thing I was curious about was seeing which credit score were defaulting the most, before plotting the chart, I pictured lower credit scores defaulting the most, however, since most of the loans are given to borrowers with scores around the 700 vecinity, these scores were also the ones reporting most default loans likely due to the portion of the borrowers they represent.
##
## Pearson's product-moment correlation
##
## data: loans2$CreditScoreRangeLower and loans2$StatedMonthlyIncome
## t = 63.256, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1923109 0.2043578
## sample estimates:
## cor
## 0.1983418
One thing I was very confident about was the fact that people with higher bankcard utlization would be more likely fail to pay on time. But my assumption was wrong again. In fact, this is one of the weakest relationships I explored, almost 0.
##
## Pearson's product-moment correlation
##
## data: loans2$BankcardUtilization and loans2$CurrentDelinquencies
## t = -21.313, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07426319 -0.06178106
## sample estimates:
## cor
## -0.06802478
When I moved into this country, I had to build a credit score, given the fact that I had none. It was a struggle, because no financial institution would even let me open a credit card with them and how are you supposed to build a credit score if no one gives you the opportunity to build credit? Well, you start with a secured credit card, which essentially is paying the bank fees and interest for you to borrow from your own money, but these payments are reported to credit bureaus and that’s how you start buidling credit history. I knew that I would give my 100% to the bank to pay them back, but me telling them did not mean anything, because they really didn’t know anything about me. Data speaks for itself.
Banks have performed these type of analysis thousands and thousands of times, likely with much more depth than this, so when they ask you about your credit score, your current debt, income and more, it is for a reason. Although you may think, these factors do not apply to you, because you know you will pay back, the bank has no way of measuring your ability to pay by just listening to you say so. In a greater sense, these factors are acuarate and help minimize the losses to both borrowers and lenders.
From a technical perspective, most of the challenges I faced while building these plots came from understanding or finding a plot that depicts data that makes sense. I had the variables and knew what I was looking to see, however building was hard, I either had line graphs with way too much noise, irrational bar graphs or scatter plots with lines of dots that made no sense. I also wanted to build some pie charts, and I can build simple ones, but was not able to build one with the data from this set, as I lacked the knowledge to do so, even after hours of researching online how to possibly do this.